Goto

Collaborating Authors

 cocktail party


Joined Audio-Visual Speech Enhancement and Recognition in the Cocktail Party: The Tug Of War Between Enhancement and Recognition Losses

Pasa, Luca, Morrone, Giovanni, Badino, Leonardo

arXiv.org Machine Learning

In this paper we propose an end-to-end LSTM-based model that performs single-channel speech enhancement and phone recognition in a cocktail party scenario where visual information of the target speaker is available. In the speech enhancement phase the proposed system uses a "visual attention" signal of the speaker of interest to extract her speech from the input mixed-speech signal, while in the ASR phase it recognizes her phone sequence through a phone recognizer trained with a CTC loss. It is well known that learning multiple related tasks from data simultaneously can improve performance than learning these tasks independently, therefore we decided to train the model by optimizing both tasks at the same time. This allowed us also to explore whether (and how) this joint optimization leads to better results. We analyzed different training strategies that reveal some interesting and unexpected behaviors. In particular, the experiments demonstrated that during optimization of the ASR phase the speech enhancement capability of the model significantly decreases and vice-versa. We evaluated our approach on mixed-speech versions of GRID and TCD-TIMIT. The obtained results show a remarkable drop of the Phone Error Rate (PER) compared to the audio-visual baseline models trained only to perform phone recognition phase.


Google Develops AI That Can Separate Voices in a Crowd

#artificialintelligence

Google Research engineers have developed a deep learning system that can separate voices from audio-visual data recorded in crowded environments. The system they developed is the equivalent of the "cocktail party" effect, a feature of the human brain that can isolate and focus on one or more particular voices in a crowd. The system is designed to work with both audio and video data at the same time. Google says it created its novel tech by feeding it over 100,000 high-quality videos of lectures and talks hosted on YouTube. All talks were given by a single speaker, with minimal background noise. They trained the AI to recognize sounds based on lip/mouth movement.


[D] Open source pre-trained deep learning model for audio source separation (cocktail party)? • r/MachineLearning

@machinelearnbot

Cocktail party is multiple sources of speech and non speech. Even if you can isolate speech from non speech it's still a whole another issue on how to deal with cross talk.


The Lingo That'll Save Your Next Cocktail Party, From 'Rovables' to 'Manthreading'

WIRED

One of the rewards of inventing something new is that you get to name it. The name doesn't always stick; with familiarity, "horse less carriages" tend to become "automo biles" and finally mere "cars." But the original coinage stands as a wonderful snapshot of how we saw the world at a certain moment, flush with delight in new pos sibilities. And given a chance to make their mark in the lexicon, even the most sober scientists can be gleefully silly: Think of particle physics' quarks and squarks, its muons and gluons. One of the most poetic neologisms of 2016, included in our year-end round-up of the best new words, was "dark sunshine": hypothetical photons generated by (equally hypothetical) dark matter in stars.


The Three Faces of Bayes

#artificialintelligence

Last summer, I was at a conference having lunch with Hal Daume III when we got to talking about how "Bayesian" can be a funny and ambiguous term. It seems like the definition should be straightforward: "following the work of English mathematician Rev. Thomas Bayes," perhaps, or even "uses Bayes' theorem." But many methods bearing the reverend's name or using his theorem aren't even considered "Bayesian" by his most religious followers. Why is it that Bayesian networks, for example, aren't considered… y'know… Bayesian? As I've read more outside the fields of machine learning and natural language processing -- from psychometrics and environmental biology to hackers who dabble in data science -- I've noticed three broad uses of the term "Bayesian."


The Three Faces of Bayes

#artificialintelligence

Last summer, I was at a conference having lunch with Hal Daume III when we got to talking about how "Bayesian" can be a funny and ambiguous term. It seems like the definition should be straightforward: "following the work of English mathematician Rev. Thomas Bayes," perhaps, or even "uses Bayes' theorem." But many methods bearing the reverend's name or using his theorem aren't even considered "Bayesian" by his most religious followers. Why is it that Bayesian networks, for example, aren't considered… y'know… Bayesian? As I've read more outside the fields of machine learning and natural language processing -- from psychometrics and environmental biology to hackers who dabble in data science -- I've noticed three distinct uses of the term "Bayesian."


AI Contextual Reasoning Learning

#artificialintelligence

Artificial Intelligence (AI) has four seasons: hype, disappointment, funding drought, and renewed interest. I've been involved in AI research for quite some time -- I became a fellow of the Association for the Advancement of Artificial Intelligence (AAAI) in 1993 -- and I've weathered several seasonal cycles. What I'm seeing now, however, is the most puzzling cycle yet; either I'm getting old and addled, or the current cycle is unique in its magnitude. In these Big Data days, the big talk about AI's potential reminds me of what happened at the peak of earlier cycles (see, for example, the recent Wall Street Journal article. Once again, the focus is on a single technical component -- deep learning -- and hopes seem to be building that it can solve many very hard problems easily and more or less magically.